Chapter 6: Addressing & URLs

Addressing & URLs

Addressing

URLs

A Self-Addressed Internet...

Addressing & URLs

Before I can tell you about email, retrieving files via FTP, browsing the Web, or much of anything else, I must discuss how email addresses and machine names are formed, where they come from, and that sort of thing. Along with these details about email addresses and machine names, you must learn about URLs, or Uniform Resource Locators. URLs provide a coherent method of uniquely identifying resources on the Internet, ranging from Web pages to files available via FTP to WAIS sources. But let's start at the beginning...

Addressing

A rose may be a rose by any other name, but the same is not true of an Internet computer. All Internet computers think of each other in terms of numbers (not surprisingly), and all people think of them in terms of names (also not surprisingly). The Internet uses the domain name system to make sense of the millions of machines that make up the Internet. In terms of the numbers, each machine's address is composed of four numbers, each less than 256. People are generally bad about remembering more than the seven digits of a phone number, so the folks working in this field came up with a program called a domain name server. Domain name servers translate between the numeric addresses and the names; real people can remember and use the names while real computers can continue to refer to each other by number. That way, everyone is happy.

Note: Domain name servers, although generally part of the background technology that enables the Internet to work seamlessly, are tremendously important. Without them, very little on the Internet works these days.

Despite the fact that all Internet numeric addresses are sets of four numbers, the corresponding name can have between two and five sets of words. After five, it gets out of hand, so although it's possible, it's not generally done. For instance, one of the machines I use now is called king.tidbits.com (three words), and the machine I used at Cornell was called cornella.cit.cornell.edu (four words). The domain style addresses may look daunting, but in fact they are quite easy to work with, especially when you consider the numeric equivalents, such as 204.57.157.13 for king.tidbits.com. Each item in those addresses, separated by the periods, is called a domain, and in the following sections, you are going to look at them backward, or in terms of the largest domain to the smallest.

Note: A random aside for those of you who are students of classical rhetoric: The process of introducing topics A, B, and C, and then discussing them in the order C, B, and A is called chiasmus. This little known fact is entirely unrelated to the Internet, except that after the first edition of this book I took a lot of good-natured ribbing on the Internet about my classical education, so I figured I should at least pretend to know something about the topic.

Top-level Domains

In any machine name, the final word after the last dot is the top-level domain, and a limited number of them exist. Originally, and this shows the Internet's early Americo-centric view, six top-level domains indicated to what type of organization the machine belonged. Thus, we ended up with the following list:

com = commercial
edu = educational
org = organization, usually nonprofit
mil = military
net = network
gov = government

That setup was all fine and dandy for starters, but as the number of machines on the Internet began to grow at an amazing rate, a more all-encompassing solution became necessary. The new top-level domains are based on countries, so each country has its own two-letter domain. Thus, the United Kingdom's top-level domain is uk, Sweden's is se, Japan's is jp, Australia's is au, and so on. Every now and then another country comes on the Internet, and I see a domain code that totally throws me, as Iceland's is code did the first time.

Note: If you'd like to see the complete list of country codes, check out this URL:
http://www.nw.com/zone/iso-country-codes

The United States has this system, too; so, for example, The Well, a popular commercial service with links to the Internet, is well.sf.ca.us. Unfortunately, because so many sites already existed with the old domain names, it made no sense to change them. Thus, we have both types of top-level domain names here in the U.S., and you just have to live with it.

You may see a couple of other top-level domains on occasion, bitnet and uucp, such as in listserv@bitnic.bitnet or ace@tidbits.uucp. In both of these cases, the top-level domain indicates that the machine is on one of the alternative networks and may not exist directly on the Internet (otherwise, it would have a normal top-level domain such as com or uk). This setup isn't a big deal these days because so many machines exist on two networks that your email gets through just fine in most cases. In the past, though, few connections existed between the Internet and BITNET or Usenet, so getting mail through one of the existing gateways was more difficult. Keep in mind that because a machine whose name ends with bitnet or uucp is not usually on the Internet, you cannot use Telnet or FTP with it.

Many machine names are as simple as it gets: a machine name and a top-level domain. Others are more complex because of additional domains in the middle. Think of an address such as cornella.cit.cornell.edu as one of those nested Russian dolls (see figure 6.1). The outermost doll is the top-level domain, the next few dolls are the mid-level domains, and, if you go all the way in, the final doll is the userid (which I'll explain soon enough).

Figure 6.1: The Russian doll approach to Internet addresses.

Mid-level Domains

What do these mid-level domains represent? It's hard to say precisely, because the answer can vary a bit. The machine I used at Cornell, known as cornella.cit.cornell.edu, represents one way the mid-level domains have been handled. The machine name is cornella, and the top-level domain is edu, because Cornell claims all those undergraduates are there to get an education. The cit after cornella is the department, Cornell Information Technologies, that runs the machine known as cornella. The next part, cornell, is obvious; it's the name of the overall organization to which CIT belongs. So, for this machine anyway, the hierarchy of dolls is, in order, machine name, department name, organization name, and organization type.

This is similar to how my system is set up now, since I control the tidbits.com domain, and each of my Macs has a name within that domain. So, for instance, my desktop Mac is called penguin.tidbits.com, and my server is king.tidbits.com.

In the machine name for The Well, well.sf.ca.us, you see a geographic use of mid-level domains. In this case, well is the machine name, sf is the city name (San Francisco), ca is the state name (California), and us is the country code for the United States.

Mid-level domains spread the work around. Obviously, the Internet can't have machines with the same name; otherwise, chaos would erupt. But because the domain name system allows for mid-level domains, the administrators for those mid-level domains must make sure that everyone below them stays unique. In other words, I could actually name my machine cornella.tidbits.com because that name is completely different from cornella.cit.cornell.edu (though why I'd want to, I don't know). And, if they wanted, the administrators at CIT could put a new machine on the net and call it tidbits.cit.cornell.edu without any trouble, for the same reason. More importantly, the administrators don't need to bother anyone else if they want to make that change. They control the cit domain, and as long as all the machines within that domain have unique names, there aren't any problems. Of course, someone has to watch the top-level domains because it's all too likely that two people may want tidbits.com as a domain (but I've already got it, so they can't have it). That task is handled by the Internet Network Information Center, or InterNIC. As a user, you shouldn't have to worry about naming problems, because everyone should have a system administrator who knows who to talk to, and you need the cooperation of your provider anyway -- you can't set up a domain on your own.

There is yet another way to handle the mid-level domains, this time in terms of intermediate computers. Before I got my current address, I had a connection from a machine called halcyon, whose full name was halcyon.com. My machine name was tidbits.halcyon.com. In this case, tidbits was my machine name, halcyon was the machine through which all of my mail was routed, and com indicated that the connection was through a commercial organization. I realize that this example is a bit confusing, but I mention it because it's one way that you can pretend to have an Internet address when you really have only a UUCP connection (a different sort of connection that transfers only email and news). All my mail and news came in via UUCP through halcyon, so by including halcyon in my address, I created an Internet-style address.

The other way of pretending that a UUCP connection is a real Internet connection for address purposes, is to have your host set up an MX record (where MX stands for Mail Exchange). An MX record is a pointer on several true Internet machines to your site.

Machine Name

The next part in the full domain name is the machine itself; for example, in the name penguin.tidbits.com, penguin. is the name of my machine. In my case, the machine is a Macintosh 660AV, but people use all sorts of machines, and because the system administrators often are a punchy, overworked lot, they tend to give machines silly names. Large organizations with more centralized control lean more toward thoroughly boring names, like the machine at Cornell, which was called cornella (as opposed to cornellc and cornelld and cornellf).

Note: For those who are wondering, the naming scheme I use is based on the names of species of penguins. Also, if you're wondering why you can send email to ace@tidbits.com if my machine is really called penguin.tidbits.com, it's because of the magic of the domain name system. Since most people like shorter addresses, it's common to map the shorter domain name, tidbits.com, to point to the server that handles mail specifically, king.tidbits.com in my case. Then, I set Eudora to look for mail on king.tidbits.com and everything works swimmingly. These are the sort of machinations that Internet providers continually deal with. Luckily, you as the user can usually ignore them.

One of the reasons for boring names is that in the early days, machines on BITNET had to have names with between six and eight characters. Coming up with a meaningful unique name within that restriction became increasingly difficult. Usenet doesn't put a limit on the length of names, but it requires that the first six characters be unique. Currently, the Internet allows the second level domain to be up to 24 characters, and the third level domain can be up to 72 characters. In no case can the full domain name go over 256 characters, however.

If you remember that machines often exist on the Internet as well as on one of these other networks, thereby blurring the distinctions, you'll see the problem. The limitations of Internet machine names are less rigid, so alternative connections dictate what names are acceptable.

Often, special services keep their names even when they move to different machines or even different organizations. Because of this situation, a machine that runs a service may have two names, one that goes with the machine normally and one that points solely at that service. For instance, the anonymous FTP site that I use to store all the software I talk about in this book is called ftp.tidbits.com. But in fact, it runs on a machine called ftp.halcyon.com, and I could move it to any other machine while still retaining the ftp.tidbits.com name. This situation is not a big deal one way or another.

To summarize, you can have multiple domains in a machine name, and the further you go to the right, the more general they become, often ending in the country code. Conversely, the further you go to the left, the more specific the domains become, ending in the machine name because it's the most specific.

But what about email addresses, which have userids? They're even more specific than machine names, since you can have many userids on a single machine.

Userid

Now that you've looked at the machine name, you can move on to the userid or username, which identifies a specific user on a machine. Both terms are equally correct (with two exceptions -- the commercial online service GEnie and the FirstClass BBS software both treat userids and usernames separately) and commonly used. If you set up your own machine, or work with a sufficiently flexible provider, you can choose your own username. Choosing your own name is good because then your correspondents can more easily remember your address, assuming of course that you choose a userid that makes sense and is easy to type. If I made my address ferdinand-the-bull@tidbits.com, people who typed the address slightly wrong and had their mail bounced back to them would become irritated at me.

Unlike Macintosh filenames (and America Online and eWorld userids), Internet userids cannot have spaces in them, so convention dictates that you replace any potential spaces with underscores, dashes, or dots, or omit them entirely. Other reasonable userids that I could use (but don't) include adam_engst@tidbits.com or adam-engst@tidbits.com or adam.engst@tidbits.com or adamengst@tidbits.com. However, all of these names are more difficult to type than ace@tidbits.com, and because I have good initials, I stick with them.

Unfortunately, there are a limited number of possible userids, especially at a large site. So Cornell, for instance, with its thousands of students and staff, has opted for a system of using initials plus one or more digits (because initials aren't all that unique, either -- in fact, I once asked for my initials as a userid on one of Cornell's mainframes and was told that ACE was a reserved word in that machine's operating system, though no one could tell me what it was reserved for).

Microsoft uses yet a different scheme: first name and last initial (using more than one initial to keep the userids unique). As Microsoft has grown, common names such as David have been used up, so the company has started other schemes such as first initial and last name. Why am I telling you this? Because knowing an organization's scheme can prove useful at times if you're trying to figure out how to send mail to someone at that organization, and so that I can note a societal quirk. At places like Microsoft where people use email so heavily, many folks refer to each other by email names exclusively. When my wife, Tonya, worked at Microsoft, she had a problem with her username, tonyae (first name and last initial) because it looked more like TonyAe than TonyaE to most people.

The real problem with assigned userids comes when the scheme is ludicrously random. Some universities work student ID numbers into the userid, for instance, and CompuServe userids are mere strings of digits like 72511,306. I believe the scheme has something to do with octal numbers or some such technoweenie hoo-hah. I don't speak octal or septal, or any such nonsense, and as a result, I can never remember CompuServe userids.

Remember that email addresses point at an individual, but when you're using services such as Telnet or FTP, no individual is involved. You simply want to connect to that machine, and you have to connect sans userid. This restriction may seem obvious, but it often trips people up until they get used to it. For example, it seems that you could just FTP to anonymous@space.alien.com. The system doesn't work that way, though, and you FTP to space.alien.com, and once there, log in as anonymous. More about FTP in later chapters.

Punctuation

Enough about userids. What about all this punctuation? Better known as Shift-2 (on U.S. keyboards anyway), the @ symbol came into use, I imagine, because it's a single character that generally means "at" in traditional usage. The @ symbol is generally universal for Internet email, but not all types of networks have always used it. For instance, some BITNET machines once required you to spell out the word, as in the command TELL LISTSERV AT BITNIC HELP. Luckily, almost everything uses the @ symbol with no spaces these days, which reduces four characters to one, and probably has saved untold person-hours worth of typing over the years.

As long as you're learning about special characters, look at the dot. It is, of course, the period character on the keyboard, and it serves to separate the domains in the address. For various reasons unknown to me, the periods have become universally known as dots in the context of addresses. When you tell someone your email address over the phone, you say (or rather I'd say because it's my address), "My email address is ace at tidbits dot com." The other person must know that "at" equals the @ symbol and that "dot" equals the period. If he's unsure, explain yourself.

Alternative Addresses

You may see two other styles of addressing mail on the Internet, both of which work to sites that aren't actually on the Internet itself. The first, and older, of the two is called bang addressing. It was born in the early days when there were relatively few machines using UUCP. Not every machine knew how to reach every other machine, so the trick was to get the mail out to a machine that knew about a machine that knew about a machine that knew about your machine. Talk about a friend of a friend! So, you could once have sent email to an address that looked like uunet!nwnexus!caladan!tidbits!ace. This address would have sent the mail from uunet to nwnexus to caladan to tidbits and finally to my userid on tidbits. This approach assumes that your machine knows about the machine uunet (run by the commercial provider UUNET) and that all of the machines in the middle are up and running. All the exclamation points are called "bangs," appropriately enough, I suppose. On the whole, this style of addressing is slow and unreliable these days, but if you use a machine that speaks UUCP, you can occasionally use it to your advantage. For instance, every now and then, I try to send email to a machine that my UUCP host, nwnexus.wa.com, for some reason can't reach. By bang-routing the mail appropriately, I can make another Internet machine try to send the mail out, sometimes with greater success.

The other sort of special addressing is another way to get around the fact that your machine, or even your network, isn't connected to the Internet as such. In this case, you must provide two addresses: one to get to the machine that feeds your machine, and one to get to your machine. The problem here is that Internet addresses cannot have more than one @ symbol in them. You can replace the first @ symbol with a % symbol, and the mailers then try to translate the address properly. My old address, ace@tidbits.halcyon.com, also could have been ace%tidbits.uucp@halcyon.com. These tricks are ugly and awkward, but sometimes necessary. Luckily, as the Internet grows and standardizes, you need fewer and fewer of these addressing tricks.

Enough on machine names and email addresses, then. If you keep the previous discussions in mind when you're using the Internet, you shouldn't be confused by any address you see. And if you are confused, perhaps that address is seriously malformed. I've seen it happen before.

URLs

Before I talk about any of the various TCP/IP-based Internet services, I want to explain URLs, or Uniform Resource Locators. These constitute the most common and efficient method of telling people about resources available via FTP, the World Wide Web, and other Internet services. URLs have become so popular that the Library of Congress has even added a subfield for them when it catalogs electronic resources.

Note: URL generally stands for Uniform Resource Locator, although some people switch "uniform" for "universal." Despite what I've heard from one source, I have never heard anyone pronounce URL as "earl;" instead, everyone I've talked to, including one person from CERN who helped develop the World Wide Web, spells out the letters.

What are URLs?

A URL uniquely specifies the location of something on the Internet, using three main bits of information that you need in order to access any given object. First is the URL scheme, or the type of server making the object available, be it an FTP, Gopher, or World Wide Web server. Second, comes the address of the resource. Third and finally, there's the full pathname or identifier for the object, if necessary.

Note: Don't worry if I talk about Internet services that you haven't read about in detail yet. That's what the next few chapters are for, but I wanted to explain the way that people (including me in this book) provide pointers to specific resources available via the various services like FTP, Gopher, and the World Wide Web.

This description is slightly oversimplified, but the point I want to make is that URLs are an attempt to provide a consistent way to reference objects on the Internet. I say "objects" because you can specify URLs not only for files and Web pages, but also for stranger things, such as email addresses, Telnet sessions, and Usenet news postings.

Table 6.1 shows the main URL schemes that you're likely to see.

Table 6.1: Common URL Schemes Scheme Internet Protocol Sample Client ftp File Transfer Protocol Anarchie gopher Gopher protocol TurboGopher http HyperText Transfer Protocol MacWeb mailto Simple Mail Transport Protocol Eudora news Net News Transport Protocol NewsWatcher wais Wide Area Information Servers MacWAIS

URL Construction

If you see a URL that starts with ftp, you know that the file specified in the rest of the URL is available via FTP, which means that you could use FTP under Unix, FTP via email, or a MacTCP-based FTP client such as Anarchie to retrieve it. If the URL starts with gopher, use TurboGopher or another Gopher client. If it starts with http, use MacWeb, NCSA Mosaic, or Netscape or some other Web browser. And, finally, if a URL starts with wais, you can use MacWAIS or another WAIS client.

Note: You can use a Web browser to access most of the URL schemes in Table 6.1, although Web browsers are not necessarily ideal for anything but information on the World Wide Web itself. Web browsers work pretty well for accessing files on Gopher servers and via gateways to WAIS databases, but FTP via a Web browser is clumsy (and may fail entirely with certain types of files, such as self-extracting archives). Similarly, although it's handy to use mailto URLs to send mail, I dislike doing so because then I don't have a record of my outgoing mail, as I do when I send mail from Eudora. And, no Web browser stands up to NewsWatcher in terms of news capabilities.

After the URL scheme comes a colon (:), which delimits the scheme from what comes next. If two slashes (//) come next, they indicate that a machine name in the format of an IP address will follow, such as with http://www.apple.com/ or ftp://ftp.info.apple.com/. However, if the URL points at an address in some other format, such as an email address like mailto:president@whitehouse.gov, the slashes aren't appropriate and don't appear.

Note: In some rare circumstances, you may need to use a username and password in an FTP URL as well. A URL with a username and password might look like this:
ftp://username:password@domain.name/pub/

The last part of the URL is the specific information that you're looking for, be it an email address or more commonly, the path to the directory of the file that you desire. Directory names are separated from the machine name by a slash (/). You may not have to specify the path with some URLs if you're only connecting to the top level of the site.

So, for instance, let's dissect a URL that points at the Product Support page on Apple's Web server:

http://www.apple.com/documents/productsupport.html

First off, the http part tells us that we should use a Web browser to access this URL. Then, www.apple.com is the name of the host machine that's running the Web server. The next part, /documents/productsupport.html, is the full path to the file the Web browser shows us, so /documents is a directory, and productsupport.html is the actual file inside the /documents directory.

If an FTP or Gopher URL ends with a slash, that always means it points at a directory and not a file. If it doesn't end with a slash, it may or may not point at a directory. If it's not obvious from the last part of the path, there's no good way of telling until you go there. Thus, this URL points at a directory and will return the directory listing of the files there:

ftp://ftp.tidbits.com/pub/tidbits/

However, this URL points directly at a file:

ftp://ftp.tidbits.com/pub/tidbits/issues/1990/TidBITS#001_16-Apr-90.etx

Because most Web servers enable the creation of a default file that serves in the absence of a specific file in the URL, it's usually less important for Web users to realize whether or not they're specifying a file or a directory. In other words,

http://www.tidbits.com/tidbits/index.html

points at a file, but the Web server running on that machine will display the same file (since it's the default), if you simply used this URL:

http://www.tidbits.com/tidbits/

Using URLs

All of these details aside, how do you use URLs? Your mileage may vary, but I use them in three basic ways. First, if I see them in email or in a Usenet posting, I often copy and paste the host part into Anarchie (if they are FTP URLs), or I paste the whole thing into MacWeb or Netscape (if any other scheme). That's the easiest way to retrieve a file or connect to a site if you have a MacTCP-based Internet connection.

Note: Actually, thanks to some slick programming, all I'd really do is Command-click on the URL in NewsWatcher, say, and it would automatically transfer that URL to the appropriate client program, Anarchie, TurboGopher, MacWeb, or whatever.

Second, if for some reason I don't want to use MacWeb or Netscape (I far prefer Anarchie for FTP, for instance), sometimes I manually dissect the URL, as we did with the Product Support page on the Apple Web server, to figure out which program to use and where to go. This method takes more work, but sometimes pays off in the end. (You can put a screw in the wall with a hammer, but it's not the best tool for the job.)

Third and finally, whenever I want to point people to a specific Internet resource or file available for anonymous FTP, I give them a URL. URLs are unambiguous, and although a bit ugly in running text, easier to use than attempting to spell out what they mean. Consider the example below:

ftp://ftp.tidbits.com/pub/tidbits/issues/1995/TidBITS#261_30-Jan-95.etx

To verbally explain the same information contained in that URL, I would have to say something like: "Using an FTP client program, connect to the anonymous FTP site ftp.tidbits.com. Change directories into the /pub/tidbits/issues/1995/ directory, and once you're there, retrieve the file TidBITS#261/30-Jan-95.etx." A single URL enables me to avoid such convoluted (and boring) language; and frankly, URLs are in such common use on the Internet, you may as well get used to seeing them right now.

Note: URLs sometimes have to break between two lines in publications. If you see a two-line URL that doesn't look quite right, stick the two lines back together when you're typing or pasting it into a Web browser, perhaps without a hyphen that might have been introduced in the production process.

So, from now on, whenever I mention a file available via FTP or a Web site, I'll use a URL. If you try to retrieve a file or connect to a Web site and are unsuccessful, chances are either you've typed the URL slightly wrong, or the file or server no longer exists. It's extremely likely that many of the files I give URLs for will have been updated by the time you read this, so the file name at the end of the URL may have changed.

So if a URL doesn't work, and this is a general piece of good advice, try removing the file name from the last part of the URL and look in the directory that the original file lived in for the updated file. If all else fails, you can remove everything after the machine name and work your way down to the file you are after.

If, after all this, you'd like to learn more about the technical details behind the URL specifications, check out:

http://www.w3.org/hypertext/WWW/Addressing/URL/Overview.html

Weird Characters

There is one rather messy part to URLs that you don't usually have to deal with, but that comes up on occasion, most commonly in relation to Gopher URLs. There are certain characters which cannot appear in certain parts of a URL, including spaces. And if one of those characters would appear, it's replaced with what's called an escape code, consisting of a percent symbol and the hexadecimal number corresponding to that character.

The reason this comes up most often in relation to Gopher URLs is that Gopher allows extremely long titles for files and directories, and allows pretty much any character within them, including spaces, slashes, question marks, and so on. So a Gopher URL may look a bit like this:

gopher://gopher.tc.umn.edu/11/Information%20About%20Gopher

Notice all the %20 escape codes that stand in for what are spaces on the real Gopher menu title.

For the most part, you don't have to worry about the way the spaces and other characters (see Table 6.2 for a list of some common ones that will show up in a URL as escape codes) are translated -- I just wanted to show you that this sort of thing happens so you won't be confused the first time you see a URL with all sorts of what seem like garbage characters in it.

           Table 6.2: Some Reserved Characters in URLs

        Character            Escape code replacement

        =                    %3D
        ;                    %3B
        /                    %2F
        #                    %23
        ?                    %3F
        :                    %3A
        space                %20
        ~                    %7E

A Self-Addressed Internet...

I believe that I promised early on in this book that there would be no quiz, but if I were going to break my promise, this is probably the chapter I'd do it in. You cannot get around on the Internet unless you understand how machine names and email addresses are put together. And, as the World Wide Web continues on its steamroller path to become the most popular of Internet services, a working knowledge of URLs, no matter how ugly they may seem to you now, is absolutely essential if you're to understand where you're going and what you're seeing.